Analyzing New York City 311 Customer Service from 2016 to Present
In [2]:
#import pandas library aliased as pd
import pandas as pd
# Initialize df variable as dataframe
df = pd.DataFrame()
print(df)
Empty DataFrame Columns: [] Index: []
1. Import the dataset stored in 'Customer Service_Requests_from_2010_to_Present.csv'¶
In [4]:
#store csv data into a data frame
df = pd.read_csv("Customer Service_Requests_from_2010_to_Present.csv", low_memory=False)
#print first few data in the dataframe
print(df.head())
Unique Key Created Date Closed Date Agency \
0 32310363 12/31/2015 11:59:45 PM 01-01-16 0:55 NYPD
1 32309934 12/31/2015 11:59:44 PM 01-01-16 1:26 NYPD
2 32309159 12/31/2015 11:59:29 PM 01-01-16 4:51 NYPD
3 32305098 12/31/2015 11:57:46 PM 01-01-16 7:43 NYPD
4 32306529 12/31/2015 11:56:58 PM 01-01-16 3:24 NYPD
Agency Name Complaint Type \
0 New York City Police Department Noise - Street/Sidewalk
1 New York City Police Department Blocked Driveway
2 New York City Police Department Blocked Driveway
3 New York City Police Department Illegal Parking
4 New York City Police Department Illegal Parking
Descriptor Location Type Incident Zip \
0 Loud Music/Party Street/Sidewalk 10034.0
1 No Access Street/Sidewalk 11105.0
2 No Access Street/Sidewalk 10458.0
3 Commercial Overnight Parking Street/Sidewalk 10461.0
4 Blocked Sidewalk Street/Sidewalk 11373.0
Incident Address ... Bridge Highway Name Bridge Highway Direction \
0 71 VERMILYEA AVENUE ... NaN NaN
1 27-07 23 AVENUE ... NaN NaN
2 2897 VALENTINE AVENUE ... NaN NaN
3 2940 BAISLEY AVENUE ... NaN NaN
4 87-14 57 ROAD ... NaN NaN
Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction \
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
Ferry Terminal Name Latitude Longitude \
0 NaN 40.865682 -73.923501
1 NaN 40.775945 -73.915094
2 NaN 40.870325 -73.888525
3 NaN 40.835994 -73.828379
4 NaN 40.733060 -73.874170
Location
0 (40.86568153633767, -73.92350095571744)
1 (40.775945312321085, -73.91509393898605)
2 (40.870324522111424, -73.88852464418646)
3 (40.83599404683083, -73.82837939584206)
4 (40.733059618956815, -73.87416975810375)
[5 rows x 53 columns]
2. Provide insight on the information and details that the provided dataset carries.¶
In [6]:
# Information of the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300698 entries, 0 to 300697 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unique Key 300698 non-null int64 1 Created Date 300698 non-null object 2 Closed Date 298534 non-null object 3 Agency 300698 non-null object 4 Agency Name 300698 non-null object 5 Complaint Type 300698 non-null object 6 Descriptor 294784 non-null object 7 Location Type 300567 non-null object 8 Incident Zip 298083 non-null float64 9 Incident Address 256288 non-null object 10 Street Name 256288 non-null object 11 Cross Street 1 251419 non-null object 12 Cross Street 2 250919 non-null object 13 Intersection Street 1 43858 non-null object 14 Intersection Street 2 43362 non-null object 15 Address Type 297883 non-null object 16 City 298084 non-null object 17 Landmark 349 non-null object 18 Facility Type 298527 non-null object 19 Status 300698 non-null object 20 Due Date 300695 non-null object 21 Resolution Description 300698 non-null object 22 Resolution Action Updated Date 298511 non-null object 23 Community Board 300698 non-null object 24 Borough 300698 non-null object 25 X Coordinate (State Plane) 297158 non-null float64 26 Y Coordinate (State Plane) 297158 non-null float64 27 Park Facility Name 300698 non-null object 28 Park Borough 300698 non-null object 29 School Name 300698 non-null object 30 School Number 300698 non-null object 31 School Region 300697 non-null object 32 School Code 300697 non-null object 33 School Phone Number 300698 non-null object 34 School Address 300698 non-null object 35 School City 300698 non-null object 36 School State 300698 non-null object 37 School Zip 300697 non-null object 38 School Not Found 300698 non-null object 39 School or Citywide Complaint 0 non-null float64 40 Vehicle Type 0 non-null float64 41 Taxi Company Borough 0 non-null float64 42 Taxi Pick Up Location 0 non-null float64 43 Bridge Highway Name 243 non-null object 44 Bridge Highway Direction 243 non-null object 45 Road Ramp 213 non-null object 46 Bridge Highway Segment 213 non-null object 47 Garage Lot Name 0 non-null float64 48 Ferry Direction 1 non-null object 49 Ferry Terminal Name 2 non-null object 50 Latitude 297158 non-null float64 51 Longitude 297158 non-null float64 52 Location 297158 non-null object dtypes: float64(10), int64(1), object(42) memory usage: 121.6+ MB
In [7]:
# Dimension of the dataframe
print(f'Dimension: {df.shape}')
Dimension: (300698, 53)
In [8]:
# Description of the dataframe
df.describe()
Out[8]:
| Unique Key | Incident Zip | X Coordinate (State Plane) | Y Coordinate (State Plane) | School or Citywide Complaint | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Garage Lot Name | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.006980e+05 | 298083.000000 | 2.971580e+05 | 297158.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 297158.000000 | 297158.000000 |
| mean | 3.130054e+07 | 10848.888645 | 1.004854e+06 | 203754.534416 | NaN | NaN | NaN | NaN | NaN | 40.725885 | -73.925630 |
| std | 5.738547e+05 | 583.182081 | 2.175338e+04 | 29880.183529 | NaN | NaN | NaN | NaN | NaN | 0.082012 | 0.078454 |
| min | 3.027948e+07 | 83.000000 | 9.133570e+05 | 121219.000000 | NaN | NaN | NaN | NaN | NaN | 40.499135 | -74.254937 |
| 25% | 3.080118e+07 | 10310.000000 | 9.919752e+05 | 183343.000000 | NaN | NaN | NaN | NaN | NaN | 40.669796 | -73.972142 |
| 50% | 3.130436e+07 | 11208.000000 | 1.003158e+06 | 201110.500000 | NaN | NaN | NaN | NaN | NaN | 40.718661 | -73.931781 |
| 75% | 3.178446e+07 | 11238.000000 | 1.018372e+06 | 224125.250000 | NaN | NaN | NaN | NaN | NaN | 40.781840 | -73.876805 |
| max | 3.231065e+07 | 11697.000000 | 1.067173e+06 | 271876.000000 | NaN | NaN | NaN | NaN | NaN | 40.912869 | -73.700760 |
3. Convert the columns "Created Date" and "Closed Date" to datetime datatype and create a new column "Request_Closing_Time" as the time elapsed between request creation and request closing.¶
Converting 'Created Date' and 'Closed Date' column to datetime¶
In [11]:
# Print 'Created Date' column data type
print("'Created Date' Data type: " + str(df['Created Date'].dtype))
# Print 'Closed Date' data type
print("'Closed Date' Data type: " + str(df['Closed Date'].dtype))
'Created Date' Data type: object 'Closed Date' Data type: object
In [12]:
#convert Created Date column to date time
df['Created Date'] = pd.to_datetime(df['Created Date'], dayfirst=True, format="mixed")
#convert Closed Date column to date time
df['Closed Date'] = pd.to_datetime(df['Closed Date'], dayfirst=True, format="mixed")
In [13]:
# Print 'Created Date' column data type
print("'Created Date' Data type: " + str(df['Created Date'].dtype))
# Print 'Closed Date' data type
print("'Closed Date' Data type: " + str(df['Closed Date'].dtype))
'Created Date' Data type: datetime64[ns] 'Closed Date' Data type: datetime64[ns]
New column creation: 'Request_Closing_Time'¶
In [15]:
# Creating new column Request_Closing_Time
df['Request_Closing_Time'] = df['Closed Date'] - df['Created Date']
In [16]:
# Output of the changed columns
df[['Created Date', 'Closed Date', 'Request_Closing_Time']]
Out[16]:
| Created Date | Closed Date | Request_Closing_Time | |
|---|---|---|---|
| 0 | 2015-12-31 23:59:45 | 2016-01-01 00:55:00 | 0 days 00:55:15 |
| 1 | 2015-12-31 23:59:44 | 2016-01-01 01:26:00 | 0 days 01:26:16 |
| 2 | 2015-12-31 23:59:29 | 2016-01-01 04:51:00 | 0 days 04:51:31 |
| 3 | 2015-12-31 23:57:46 | 2016-01-01 07:43:00 | 0 days 07:45:14 |
| 4 | 2015-12-31 23:56:58 | 2016-01-01 03:24:00 | 0 days 03:27:02 |
| ... | ... | ... | ... |
| 300693 | 2015-03-29 00:33:41 | NaT | NaT |
| 300694 | 2015-03-29 00:33:28 | 2015-03-29 02:33:59 | 0 days 02:00:31 |
| 300695 | 2015-03-29 00:33:03 | 2015-03-29 03:40:20 | 0 days 03:07:17 |
| 300696 | 2015-03-29 00:33:02 | 2015-03-29 04:38:35 | 0 days 04:05:33 |
| 300697 | 2015-03-29 00:33:01 | 2015-03-29 04:41:50 | 0 days 04:08:49 |
300698 rows × 3 columns
4. Drop irrelevant Columns which are stored below in column_names.¶
In [18]:
# Store column names in a variable
column_names = ['Agency Name','Incident Address','Street Name','Cross Street 1','Cross Street 2',
'Intersection Street 1', 'Intersection Street 2','Address Type','Park Facility Name','Park Borough',
'School Name', 'School Number','School Region','School Code','School Phone Number','School Address',
'School City','School State','School Zip','School Not Found','School or Citywide Complaint','Vehicle Type',
'Taxi Company Borough','Taxi Pick Up Location','Bridge Highway Name','Bridge Highway Direction',
'Road Ramp','Bridge Highway Segment','Garage Lot Name','Ferry Direction','Ferry Terminal Name','Landmark',
'X Coordinate (State Plane)','Y Coordinate (State Plane)','Due Date','Resolution Action Updated Date',
'Community Board','Facility Type','Location']
In [19]:
# Display all columns in df
print("Column Name")
print("_" * 40)
for column in df.columns:
print(column)
Column Name ________________________________________ Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type City Landmark Facility Type Status Due Date Resolution Description Resolution Action Updated Date Community Board Borough X Coordinate (State Plane) Y Coordinate (State Plane) Park Facility Name Park Borough School Name School Number School Region School Code School Phone Number School Address School City School State School Zip School Not Found School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location Request_Closing_Time
In [20]:
# Display total columns in df
print("Total columns: ", len(df.columns))
Total columns: 54
In [21]:
# Modify the dataframe 'df' by dropping all columns listed in the variable
df.drop(columns=column_names, inplace=True)
In [22]:
# Display all columns in df
print("Column Name")
print("_" * 40)
for column in df.columns:
print(column)
Column Name ________________________________________ Unique Key Created Date Closed Date Agency Complaint Type Descriptor Location Type Incident Zip City Status Resolution Description Borough Latitude Longitude Request_Closing_Time
In [23]:
# Display total columns in df
print("Total columns: ", len(df.columns))
Total columns: 15
5. Remove the NaN missing values from updated dataframe.¶
In [25]:
# Few datas from the last
df.tail()
Out[25]:
| Unique Key | Created Date | Closed Date | Agency | Complaint Type | Descriptor | Location Type | Incident Zip | City | Status | Resolution Description | Borough | Latitude | Longitude | Request_Closing_Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 300693 | 30281872 | 2015-03-29 00:33:41 | NaT | NYPD | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | NaN | NaN | Open | Your complaint has been forwarded to the New Y... | Unspecified | NaN | NaN | NaT |
| 300694 | 30281230 | 2015-03-29 00:33:28 | 2015-03-29 02:33:59 | NYPD | Blocked Driveway | Partial Access | Street/Sidewalk | 11418.0 | RICHMOND HILL | Closed | The Police Department responded and upon arriv... | QUEENS | 40.694077 | -73.846087 | 0 days 02:00:31 |
| 300695 | 30283424 | 2015-03-29 00:33:03 | 2015-03-29 03:40:20 | NYPD | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 11206.0 | BROOKLYN | Closed | The Police Department responded to the complai... | BROOKLYN | 40.699590 | -73.944234 | 0 days 03:07:17 |
| 300696 | 30280004 | 2015-03-29 00:33:02 | 2015-03-29 04:38:35 | NYPD | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10461.0 | BRONX | Closed | The Police Department responded to the complai... | BRONX | 40.837708 | -73.834587 | 0 days 04:05:33 |
| 300697 | 30281825 | 2015-03-29 00:33:01 | 2015-03-29 04:41:50 | NYPD | Noise - Commercial | Loud Music/Party | Store/Commercial | 10036.0 | NEW YORK | Closed | The Police Department responded to the complai... | MANHATTAN | 40.760583 | -73.985922 | 0 days 04:08:49 |
In [26]:
# Check missing values
print("Column Name\t\tTotal NaN")
df.isna().sum()
Column Name Total NaN
Out[26]:
Unique Key 0 Created Date 0 Closed Date 2164 Agency 0 Complaint Type 0 Descriptor 5914 Location Type 131 Incident Zip 2615 City 2614 Status 0 Resolution Description 0 Borough 0 Latitude 3540 Longitude 3540 Request_Closing_Time 2164 dtype: int64
In [27]:
# Modifying the dataframe 'df' by dropping all the NaN/NaT value rows
df.dropna(inplace=True)
In [28]:
# Check missing values
print("Column Name\t\tTotal NaN")
df.isna().sum()
Column Name Total NaN
Out[28]:
Unique Key 0 Created Date 0 Closed Date 0 Agency 0 Complaint Type 0 Descriptor 0 Location Type 0 Incident Zip 0 City 0 Status 0 Resolution Description 0 Borough 0 Latitude 0 Longitude 0 Request_Closing_Time 0 dtype: int64
In [29]:
# Updated data shown from the last rows
df.tail()
Out[29]:
| Unique Key | Created Date | Closed Date | Agency | Complaint Type | Descriptor | Location Type | Incident Zip | City | Status | Resolution Description | Borough | Latitude | Longitude | Request_Closing_Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 300692 | 30281370 | 2015-03-29 00:34:32 | 2015-03-29 01:13:01 | NYPD | Noise - Commercial | Loud Music/Party | Store/Commercial | 10002.0 | NEW YORK | Closed | The Police Department responded to the complai... | MANHATTAN | 40.716053 | -73.991378 | 0 days 00:38:29 |
| 300694 | 30281230 | 2015-03-29 00:33:28 | 2015-03-29 02:33:59 | NYPD | Blocked Driveway | Partial Access | Street/Sidewalk | 11418.0 | RICHMOND HILL | Closed | The Police Department responded and upon arriv... | QUEENS | 40.694077 | -73.846087 | 0 days 02:00:31 |
| 300695 | 30283424 | 2015-03-29 00:33:03 | 2015-03-29 03:40:20 | NYPD | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 11206.0 | BROOKLYN | Closed | The Police Department responded to the complai... | BROOKLYN | 40.699590 | -73.944234 | 0 days 03:07:17 |
| 300696 | 30280004 | 2015-03-29 00:33:02 | 2015-03-29 04:38:35 | NYPD | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10461.0 | BRONX | Closed | The Police Department responded to the complai... | BRONX | 40.837708 | -73.834587 | 0 days 04:05:33 |
| 300697 | 30281825 | 2015-03-29 00:33:01 | 2015-03-29 04:41:50 | NYPD | Noise - Commercial | Loud Music/Party | Store/Commercial | 10036.0 | NEW YORK | Closed | The Police Department responded to the complai... | MANHATTAN | 40.760583 | -73.985922 | 0 days 04:08:49 |
6. Show the unique values from all the columns in the dataframe.¶
In [31]:
# Loop through each column in dataframe
for column in df.columns:
# Print column name, column data, total unique values
print(f'Column Name: {column}')
print(f'Total Unique Values: {df[column].nunique()}')
print(f'Unique Values: \n{df[column].unique()}')
print('_' * 100)
Column Name: Unique Key Total Unique Values: 291107 Unique Values: [32310363 32309934 32309159 ... 30283424 30280004 30281825] ____________________________________________________________________________________________________ Column Name: Created Date Total Unique Values: 251970 Unique Values: <DatetimeArray> ['2015-12-31 23:59:45', '2015-12-31 23:59:44', '2015-12-31 23:59:29', '2015-12-31 23:57:46', '2015-12-31 23:56:58', '2015-12-31 23:56:30', '2015-12-31 23:55:32', '2015-12-31 23:54:05', '2015-12-31 23:53:58', '2015-12-31 23:52:58', ... '2015-03-29 00:42:48', '2015-03-29 00:37:15', '2015-03-29 00:35:28', '2015-03-29 00:35:23', '2015-03-29 00:35:04', '2015-03-29 00:34:32', '2015-03-29 00:33:28', '2015-03-29 00:33:03', '2015-03-29 00:33:02', '2015-03-29 00:33:01'] Length: 251970, dtype: datetime64[ns] ____________________________________________________________________________________________________ Column Name: Closed Date Total Unique Values: 231991 Unique Values: <DatetimeArray> ['2016-01-01 00:55:00', '2016-01-01 01:26:00', '2016-01-01 04:51:00', '2016-01-01 07:43:00', '2016-01-01 03:24:00', '2016-01-01 01:50:00', '2016-01-01 01:53:00', '2016-01-01 01:42:00', '2016-01-01 08:27:00', '2016-01-01 01:17:00', ... '2015-03-29 00:57:23', '2015-03-29 02:57:41', '2015-03-29 01:02:39', '2015-03-29 04:14:27', '2015-03-29 08:41:24', '2015-03-29 02:52:28', '2015-03-29 01:13:01', '2015-03-29 02:33:59', '2015-03-29 04:38:35', '2015-03-29 04:41:50'] Length: 231991, dtype: datetime64[ns] ____________________________________________________________________________________________________ Column Name: Agency Total Unique Values: 1 Unique Values: ['NYPD'] ____________________________________________________________________________________________________ Column Name: Complaint Type Total Unique Values: 15 Unique Values: ['Noise - Street/Sidewalk' 'Blocked Driveway' 'Illegal Parking' 'Derelict Vehicle' 'Noise - Commercial' 'Noise - House of Worship' 'Posting Advertisement' 'Noise - Vehicle' 'Animal Abuse' 'Vending' 'Traffic' 'Drinking' 'Noise - Park' 'Graffiti' 'Disorderly Youth'] ____________________________________________________________________________________________________ Column Name: Descriptor Total Unique Values: 41 Unique Values: ['Loud Music/Party' 'No Access' 'Commercial Overnight Parking' 'Blocked Sidewalk' 'Posted Parking Sign Violation' 'Blocked Hydrant' 'With License Plate' 'Partial Access' 'Unauthorized Bus Layover' 'Double Parked Blocking Vehicle' 'Vehicle' 'Loud Talking' 'Banging/Pounding' 'Car/Truck Music' 'Tortured' 'In Prohibited Area' 'Double Parked Blocking Traffic' 'Congestion/Gridlock' 'Neglected' 'Car/Truck Horn' 'In Public' 'Other (complaint details)' 'No Shelter' 'Truck Route Violation' 'Unlicensed' 'Overnight Commercial Storage' 'Engine Idling' 'After Hours - Licensed Est' 'Detached Trailer' 'Underage - Licensed Est' 'Chronic Stoplight Violation' 'Loud Television' 'Chained' 'Building' 'In Car' 'Police Report Requested' 'Chronic Speeding' 'Playing in Unsuitable Place' 'Drag Racing' 'Police Report Not Requested' 'Nuisance/Truant'] ____________________________________________________________________________________________________ Column Name: Location Type Total Unique Values: 14 Unique Values: ['Street/Sidewalk' 'Club/Bar/Restaurant' 'Store/Commercial' 'House of Worship' 'Residential Building/House' 'Residential Building' 'Park/Playground' 'Vacant Lot' 'House and Store' 'Highway' 'Commercial' 'Roadway Tunnel' 'Subway Station' 'Parking Lot'] ____________________________________________________________________________________________________ Column Name: Incident Zip Total Unique Values: 200 Unique Values: [10034. 11105. 10458. 10461. 11373. 11215. 10032. 10457. 11415. 11219. 11372. 10453. 11208. 11379. 11374. 11412. 11217. 11234. 10026. 10456. 10030. 10467. 11432. 10031. 11419. 10024. 11201. 11216. 10462. 11385. 11414. 11213. 11375. 11211. 10312. 10017. 11417. 10002. 10027. 11209. 10035. 11418. 11421. 11205. 10468. 11355. 11358. 11210. 11368. 11427. 11436. 10308. 11364. 10011. 11423. 11230. 10003. 11221. 11416. 11378. 11236. 11218. 10029. 10028. 11214. 11207. 11369. 11223. 11220. 10302. 11420. 11354. 10473. 10301. 11103. 10465. 11377. 11212. 11365. 10472. 10452. 11203. 10469. 11237. 11434. 11101. 10460. 11229. 11206. 11102. 10466. 10009. 10033. 11694. 10022. 10470. 11433. 11428. 11413. 10463. 10471. 10474. 11228. 10014. 10475. 11225. 11233. 11370. 11204. 11435. 10459. 11238. 10304. 11367. 10305. 10001. 10314. 10019. 11222. 10023. 11356. 11235. 10018. 10036. 11106. 10075. 10451. 11366. 10005. 10303. 10455. 11361. 10309. 10013. 11226. 10012. 11224. 10016. 11249. 10039. 10128. 10454. 10010. 11360. 11004. 11691. 10025. 10307. 11232. 10038. 10310. 10040. 11426. 10306. 11362. 11411. 11429. 11422. 10007. 10065. 10021. 10004. 11104. 11231. 11357. 11239. 11363. 10037. 11693. 10280. 11430. 10464. 10006. 11692. 10044. 11001. 10282. 11371. 10281. 11109. 11040. 83. 10020. 10000. 11697. 11251. 10103. 10112. 10069. 11451. 10153. 10041. 11242. 10119. 10048. 10803. 11695. 10111. 10162. 10123.] ____________________________________________________________________________________________________ Column Name: City Total Unique Values: 53 Unique Values: ['NEW YORK' 'ASTORIA' 'BRONX' 'ELMHURST' 'BROOKLYN' 'KEW GARDENS' 'JACKSON HEIGHTS' 'MIDDLE VILLAGE' 'REGO PARK' 'SAINT ALBANS' 'JAMAICA' 'SOUTH RICHMOND HILL' 'RIDGEWOOD' 'HOWARD BEACH' 'FOREST HILLS' 'STATEN ISLAND' 'OZONE PARK' 'RICHMOND HILL' 'WOODHAVEN' 'FLUSHING' 'CORONA' 'QUEENS VILLAGE' 'OAKLAND GARDENS' 'HOLLIS' 'MASPETH' 'EAST ELMHURST' 'SOUTH OZONE PARK' 'WOODSIDE' 'FRESH MEADOWS' 'LONG ISLAND CITY' 'ROCKAWAY PARK' 'SPRINGFIELD GARDENS' 'COLLEGE POINT' 'BAYSIDE' 'GLEN OAKS' 'FAR ROCKAWAY' 'BELLEROSE' 'LITTLE NECK' 'CAMBRIA HEIGHTS' 'ROSEDALE' 'SUNNYSIDE' 'WHITESTONE' 'ARVERNE' 'FLORAL PARK' 'NEW HYDE PARK' 'CENTRAL PARK' 'BREEZY POINT' 'QUEENS' 'Astoria' 'Long Island City' 'Woodside' 'East Elmhurst' 'Howard Beach'] ____________________________________________________________________________________________________ Column Name: Status Total Unique Values: 1 Unique Values: ['Closed'] ____________________________________________________________________________________________________ Column Name: Resolution Description Total Unique Values: 12 Unique Values: ['The Police Department responded and upon arrival those responsible for the condition were gone.' 'The Police Department responded to the complaint and with the information available observed no evidence of the violation at that time.' 'The Police Department responded to the complaint and took action to fix the condition.' 'The Police Department issued a summons in response to the complaint.' 'The Police Department responded to the complaint and determined that police action was not necessary.' 'The Police Department reviewed your complaint and provided additional information below.' 'Your request can not be processed at this time because of insufficient contact information. Please create a new Service Request on NYC.gov and provide more detailed contact information.' "This complaint does not fall under the Police Department's jurisdiction." 'The Police Department responded to the complaint and a report was prepared.' 'The Police Department responded to the complaint but officers were unable to gain entry into the premises.' 'The Police Department made an arrest in response to the complaint.' "Your complaint has been forwarded to the New York Police Department for a non-emergency response. If the police determine the vehicle is illegally parked, they will ticket the vehicle and then you may either contact a private towing company to remove the vehicle or ask your local precinct to contact 'rotation tow'. Any fees charged for towing will have to be paid by the vehicle owner. 311 will have additional information in 8 hours. Please note your service request number for future reference."] ____________________________________________________________________________________________________ Column Name: Borough Total Unique Values: 5 Unique Values: ['MANHATTAN' 'QUEENS' 'BRONX' 'BROOKLYN' 'STATEN ISLAND'] ____________________________________________________________________________________________________ Column Name: Latitude Total Unique Values: 123013 Unique Values: [40.86568154 40.77594531 40.87032452 ... 40.77664592 40.70635259 40.71605291] ____________________________________________________________________________________________________ Column Name: Longitude Total Unique Values: 123112 Unique Values: [-73.92350096 -73.91509394 -73.88852464 ... -73.94880526 -73.87124456 -73.9913785 ] ____________________________________________________________________________________________________ Column Name: Request_Closing_Time Total Unique Values: 53225 Unique Values: <TimedeltaArray> ['0 days 00:55:15', '0 days 01:26:16', '0 days 04:51:31', '0 days 07:45:14', '0 days 03:27:02', '0 days 01:53:30', '0 days 01:57:28', '0 days 01:47:55', '0 days 08:33:02', '0 days 01:23:02', ... '0 days 12:33:42', '0 days 07:28:23', '0 days 05:13:46', '0 days 05:19:11', '0 days 10:22:47', '0 days 09:46:41', '0 days 15:40:46', '0 days 04:44:52', '0 days 09:44:44', '0 days 15:42:26'] Length: 53225, dtype: timedelta64[ns] ____________________________________________________________________________________________________
7. Show summary statistics of sum, mean, standard deviation, skewness, and kurtosis of the data frame.¶
In [34]:
# Set format as so it displays full value
pd.set_option('display.float_format', '{:,.2f}'.format)
# Request Closing Time in Hours
df['Request_Closing_Time_In_Hours'] = df['Request_Closing_Time'].dt.total_seconds() / 3600
# Calculate sum()
print("Sum Calculation:\nColumn Name\tSum")
df[[
'Unique Key',
'Incident Zip',
'Latitude',
'Longitude',
'Request_Closing_Time_In_Hours'
]].sum()
Sum Calculation: Column Name Sum
Out[34]:
Unique Key 9,112,107,955,295.00 Incident Zip 3,160,833,212.00 Latitude 11,855,530.76 Longitude -21,520,095.17 Request_Closing_Time_In_Hours 2,242,774.45 dtype: float64
In [36]:
# Calculate mean
print("Mean Calculation:\nColumn Name\tMean")
df[[
'Unique Key',
'Incident Zip',
'Latitude',
'Longitude',
'Request_Closing_Time'
]].mean()
Mean Calculation: Column Name Mean
Out[36]:
Unique Key 31,301,576.24 Incident Zip 10,857.98 Latitude 40.73 Longitude -73.93 Request_Closing_Time 0 days 07:42:15.464980230 dtype: object
In [38]:
# Set format back to default
pd.reset_option('display.float_format')
In [40]:
# Calculate standard deviation
print("Standard Deviation Calculation:\nColumn Name\tStd")
df.std(numeric_only=True)
Standard Deviation Calculation: Column Name Std
Out[40]:
Unique Key 575377.738707 Incident Zip 580.280774 Latitude 0.082411 Longitude 0.078654 Request_Closing_Time_In_Hours 567.208468 dtype: float64
In [42]:
# Calculate Skewness
print("Skewness Calculation:\nColumn Name\tSkewness")
df.skew(numeric_only=True)
Skewness Calculation: Column Name Skewness
Out[42]:
Unique Key 0.016898 Incident Zip -2.553956 Latitude 0.123114 Longitude -0.312739 Request_Closing_Time_In_Hours -8.142360 dtype: float64
In [44]:
# Calculate Kurtosis
print("Kurtosis Calculation:\nColumn Name\tKurtosis")
df.kurt(numeric_only=True)
Kurtosis Calculation: Column Name Kurtosis
Out[44]:
Unique Key -1.176593 Incident Zip 37.827777 Latitude -0.734818 Longitude 1.455600 Request_Closing_Time_In_Hours 87.306717 dtype: float64
8. Calculate and show correlation of all variables.¶
In [47]:
# importing seaborn and matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
In [48]:
# Filter relevant numeric columns
correlation_df = df[[
'Unique Key',
'Incident Zip',
'Latitude',
'Longitude',
'Request_Closing_Time'
]]
In [49]:
# Compute correlation matrix
corr_matrix = correlation_df.corr()
corr_matrix
Out[49]:
| Unique Key | Incident Zip | Latitude | Longitude | Request_Closing_Time | |
|---|---|---|---|---|---|
| Unique Key | 1.000000 | 0.025492 | -0.032613 | -0.008621 | 0.004666 |
| Incident Zip | 0.025492 | 1.000000 | -0.499081 | 0.385934 | -0.002069 |
| Latitude | -0.032613 | -0.499081 | 1.000000 | 0.368819 | 0.002606 |
| Longitude | -0.008621 | 0.385934 | 0.368819 | 1.000000 | 0.005605 |
| Request_Closing_Time | 0.004666 | -0.002069 | 0.002606 | 0.005605 | 1.000000 |
In [50]:
# Plot correlation heatmap
sns.heatmap(corr_matrix, annot=True, fmt=".3f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
9. Four major insights through visualization that you come up after data mining.¶
*INSIGHT 1: Spatial Density by Complaint Types*
In [53]:
# Spatial Density by Complaint Types
# List of selected complaint types
complaint_types = list(df['Complaint Type'].unique())
# Loop to create a hexbin plot for each complaint type
for complaint in complaint_types:
subset = df[df['Complaint Type'] == complaint]
# Plot hexbar with appropriate configurations
plt.figure(figsize=(12, 10))
plt.hexbin(
subset['Longitude'],
subset['Latitude'],
gridsize=150,
cmap='viridis',
bins='log',
mincnt=1
)
plt.colorbar(label='Log(Count)')
plt.title(f'Spatial Density of: {complaint}', fontsize=16)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True)
# Fix vertical stretching
plt.gca().set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()
*INSIGHT 2: Complaint Distribution by Borough*
In [55]:
# Storing borough counts
borough_counts = df["Borough"].value_counts()
# Plot pie chart with appropriate configurations
plt.figure(figsize=(7, 7))
plt.pie(borough_counts, labels=borough_counts.index, autopct='%1.1f%%', startangle=140)
plt.title("Insight 2: Complaint Distribution by Borough")
plt.tight_layout()
plt.show()
*Insight 3: Requests by Hour of Day*
In [57]:
# Created hour of the day
df['Created Hour'] = df['Created Date'].dt.hour
# Histogram of requests by hour
sns.histplot(df['Created Hour'], bins=24, color='purple')
plt.title("Insight 3: Requests by Hour of Day")
plt.xlabel("Hour")
plt.ylabel("Number of Requests")
plt.xticks(range(24))
plt.show()
*Insight 4: Number of Complaints*
In [59]:
# Number of complaints of each complaint type
complaints = df["Complaint Type"].value_counts()
# Plot bar graph of most common complaint types
plt.figure(figsize=(12, 6))
sns.barplot(x=complaints.values, y=complaints.index, palette="flare")
plt.title("Insight 4: Most Common Complaint Types")
plt.xlabel("Number of Complaints")
plt.ylabel("Complaint Type")
plt.tight_layout()
plt.show()
C:\Users\Anshu\AppData\Local\Temp\ipykernel_12536\1526219640.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=complaints.values, y=complaints.index, palette="flare")
10. Arranging the complaint types according to their average 'Request_Closing_Time', categorized by various locations.¶
In [64]:
# Create 'Request_Closing_Time' in hours
df = df[df['Request_Closing_Time_In_Hours'] >= 0]
# Group by Complaint Type and Borough and calculate mean closing time
grouped_df = df.groupby(['Complaint Type', 'Borough'])['Request_Closing_Time_In_Hours'].mean().reset_index()
# Pivot the data for visualization
pivot_df = grouped_df.pivot(index='Complaint Type', columns='Borough', values='Request_Closing_Time_In_Hours')
# Select top 10 complaint types for better readability
top_complaints = df['Complaint Type'].value_counts().index
pivot_df = pivot_df.loc[pivot_df.index.intersection(top_complaints)]
# Plotting grouped bar chart
pivot_df.plot(kind='bar', figsize=(14, 7), colormap='tab10')
plt.title('Average Request Closing Time by Complaint Type and Borough')
plt.ylabel('Avg Request Closing Time (Hours)')
plt.xlabel('Complaint Type')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Borough')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
In [66]:
import plotly.express as px
# Group by Complaint Type and Borough and calculate mean closing time
grouped_df = df.groupby(['Complaint Type', 'Location Type'])['Request_Closing_Time_In_Hours'].mean().reset_index()
# Create a treemap to visualize average request closing times
# broken down by Location Type and Complaint Type
fig = px.treemap(
grouped_df,
path=['Location Type', 'Complaint Type'], # Hierarchical path for grouping
values='Request_Closing_Time_In_Hours', # Size of each block based on avg closing time
color='Request_Closing_Time_In_Hours', # Color intensity based on avg closing time
color_continuous_scale='Blues', # Use a blue color gradient
title='Avg Request Closing Time by Location Type & Complaint Type'
)
# Customize the figure layout to increase visual clarity
fig.update_layout(
width=1000, # Set figure width
height=700, # Set figure height
)
# Display the interactive treemap
fig.show()
11. Statistical Tests¶
*Test 1: Average response time across complaint types is similar or not.*
In [68]:
from scipy.stats import f_oneway
# Use top 10 complaint types for reliability
top_types = df['Complaint Type'].value_counts().index
filtered_df = df[df['Complaint Type'].isin(top_types)]
# Group by complaint type and prepare numeric durations
groups = [
group['Request_Closing_Time_In_Hours'].dropna()
for name, group in filtered_df.groupby('Complaint Type')
]
# Perform ANOVA test
f_stat, p_value = f_oneway(*groups)
print("ANOVA F-statistics:", f_stat)
print("ANOVA p-value:", p_value)
if p_value < 0.05:
print("Reject Ho: Complaint Type and Average Response Time are associated.")
else:
print("Fail to reject Ho: No association.")
ANOVA F-statistics: 50.06668185699929 ANOVA p-value: 2.462779063124107e-140 Reject Ho: Complaint Type and Average Response Time are associated.
*Test 2: Type of complaint or service requested and location are related*
In [70]:
from scipy.stats import chi2_contingency
# Create a contingency table
contingency_table = pd.crosstab(df['Complaint Type'], df['Borough'])
contingency_table
Out[70]:
| Borough | BRONX | BROOKLYN | MANHATTAN | QUEENS | STATEN ISLAND |
|---|---|---|---|---|---|
| Complaint Type | |||||
| Animal Abuse | 1386 | 2362 | 1500 | 1843 | 544 |
| Blocked Driveway | 12540 | 27830 | 2036 | 31179 | 2110 |
| Derelict Vehicle | 1916 | 5096 | 527 | 7959 | 1744 |
| Disorderly Youth | 59 | 71 | 66 | 59 | 21 |
| Drinking | 181 | 255 | 290 | 348 | 174 |
| Graffiti | 8 | 42 | 22 | 36 | 2 |
| Illegal Parking | 7689 | 27076 | 11902 | 21587 | 4807 |
| Noise - Commercial | 2394 | 11306 | 14286 | 5986 | 665 |
| Noise - House of Worship | 75 | 334 | 188 | 297 | 17 |
| Noise - Park | 512 | 1512 | 1152 | 626 | 67 |
| Noise - Street/Sidewalk | 8657 | 13109 | 20143 | 4326 | 793 |
| Noise - Vehicle | 3298 | 5081 | 5332 | 2565 | 350 |
| Posting Advertisement | 16 | 45 | 41 | 30 | 500 |
| Traffic | 344 | 1081 | 1523 | 1283 | 196 |
| Vending | 375 | 512 | 2373 | 469 | 25 |
In [71]:
# Perform chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)
print("Chi-square Statistic:", chi2_stat)
print("Chi-square p-value:", p_value)
if p_value < 0.05:
print("Reject Ho: Complaint Type and Borough are associated.")
else:
print("Fail to reject Ho: No association.")
Chi-square Statistic: 72029.76460562609 Chi-square p-value: 0.0 Reject Ho: Complaint Type and Borough are associated.
---THE END---
In [ ]: